Redesigning the message logging model for high performance
نویسندگان
چکیده
Over the past decade the number of processors used in high performance computing has increased to hundreds of thousands. As a direct consequence, and while the computational power follows the trend, the mean time between failures (MTBF) has suffered and is now being counted in hours. In order to circumvent this limitation, a number of fault-tolerant algorithms as well as execution environments have been developed using the message passing paradigm. Among them, message logging has been proved to achieve a better overall performance when the MTBF is low, mainly due to a faster failure recovery. However, message logging suffers from a high overhead when no failure occurs. Therefore, in this paper we discuss a refinement of the message logging model intended to improve the failure-free message logging performance. The proposed approach simultaneously removes useless memory copies and reduces the number of logged events. We present the implementation of a pessimistic message logging protocol in Open MPI and compare it with the previous reference implementation MPICH-V2. The results outline a several order of magnitude improvement on the performance and a zero overhead for most messages. Published in 2010 by John Wiley & Sons, Ltd.
منابع مشابه
Scalable Message - Logging Techniques for Effective Fault Tolerance in Hpc Applications
An important set of challenges emerge as the High Performance Computing (HPC) community aims to reach extreme scale. Resilience and energy consumption are two of those challenges. Extreme-scale machines are expected to have a high failure frequency. This is an inevitable consequence of the mismatch between two trends. The number of components assembled in supercomputers grows exponentially. How...
متن کاملDesign an Efficient Community-based Message Forwarding Method in Mobile Social Networks
Mobile social networks (MSNs) are a special type of Delay tolerant networks (DTNs) in which mobile devices communicate opportunistically to each other. One of the most challenging issues in Mobile Social Networks (MSNs) is to design an efficient message forwarding scheme that has a high performance in terms of delivery ratio, latency and communication cost. There are two different approaches fo...
متن کاملOn fault tolerance, performance, and reliability for wireless and sensor networks
The emerging mobile wireless environment poses exciting challenges for distributed fault-tolerant (FT) computing. This thesis develops a message logging and recovery protocol on the top of Wireless CORBA to complement FT–CORBA specified for wired networks. It employs the storage available at access bridge (AB) as the stable storage for logging messages and saving checkpoints on behalf of mobile...
متن کاملWhy Optimistic Message Logging Has Not Been UsedIn
Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach 1] that places more emphasis on failure-free overhead than recovery ee-ciency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized...
متن کاملDodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault tolerant; most are in need for a seamless recovery framework. Among the automatic fault tolerant techniques proposed for MPI, message logging is preferable for its scalable recovery. The major challenge for message logging protocols i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Concurrency and Computation: Practice and Experience
دوره 22 شماره
صفحات -
تاریخ انتشار 2010